Data loading
2025-10-31
data_loading.RmdTIRTLseq text file data structure
TIRTLseq data is kept in tab-separated text files (.tsv). The files
may be ‘gzipped’ (.tsv.gz) to save space.
Output for each sample
consists of 3 files:
-
<sample-name>_TIRTLoutput.tsv(.gz)– Paired TCRs – a table of computationally paired alpha/beta TCRs -
<sample-name>_pseudobulk_TRA.tsv(.gz)– Alpha chain “pseudo-bulk” data – average read counts/frequencies of alpha-chains across all wells on a plate -
<sample-name>_pseudobulk_TRB.tsv(.gz)– Beta chain “pseudo-bulk” data – average read counts/frequencies of beta-chains across all wells on a plate
Loading TIRTLseq Data in R
The load_tirtlseq() function loads paired TCR TIRTLseq
data from all text files in a directory.
The function can also automatically assemble a metadata table if the files are named in a way that specifies the sample metadata.
Here, our files are named
(marker)_(timepoint)_(version)_ ... .tsv.gz, e.g.
cd4_tp1_v2_TIRTLoutput.tsv.
## Error in system("nvidia-smi", intern = TRUE, ignore.stderr = TRUE) :
## error in running command
library(dplyr)
library(rmarkdown)
folder = system.file("extdata/SJTRC_TIRTL_seq_longitudinal", package = "TIRTLtools")
dir(folder)## [1] "cd4_tp1_v2_pseudobulk_TRA.tsv.gz" "cd4_tp1_v2_pseudobulk_TRB.tsv.gz"
## [3] "cd4_tp1_v2_TIRTLoutput.tsv.gz" "cd4_tp2_v2_pseudobulk_TRA.tsv.gz"
## [5] "cd4_tp2_v2_pseudobulk_TRB.tsv.gz" "cd4_tp2_v2_TIRTLoutput.tsv.gz"
## [7] "cd4_tp3_v2_pseudobulk_TRA.tsv.gz" "cd4_tp3_v2_pseudobulk_TRB.tsv.gz"
## [9] "cd4_tp3_v2_TIRTLoutput.tsv.gz" "cd8_tp1_v2_pseudobulk_TRA.tsv.gz"
## [11] "cd8_tp1_v2_pseudobulk_TRB.tsv.gz" "cd8_tp1_v2_TIRTLoutput.tsv.gz"
## [13] "cd8_tp2_v2_pseudobulk_TRA.tsv.gz" "cd8_tp2_v2_pseudobulk_TRB.tsv.gz"
## [15] "cd8_tp2_v2_TIRTLoutput.tsv.gz" "cd8_tp3_v2_pseudobulk_TRA.tsv.gz"
## [17] "cd8_tp3_v2_pseudobulk_TRB.tsv.gz" "cd8_tp3_v2_TIRTLoutput.tsv.gz"
ts_data = load_tirtlseq(folder, meta_columns = c("marker", "timepoint", "version"), sep = "_")## 64.076 sec elapsed
## these files are named (marker)_(timepoint)_(version)_etc.tsv.gzThe TIRTLseq data object
In R, TIRTLseq data is stored in a “list” which can contain any type of unstructured data.
The list contains two slots.
-
meta- a data frame with sample metadata -
data- a list with data for each sample
class(ts_data) ## list## [1] "list"
names(ts_data) ## [1] "data" "meta"## [1] "data" "meta"
The metadata table
The $meta slot contains a data frame with columns for
sample_id (determined from the filename) and any columns
you specified with the “meta_columns” argument. The label
column combines all of the metadata columns into one string.
If you did not specify “meta_columns” in the
load_tirtlseq() call, then the table will only contain the
sample_id and label columns.
paged_table(ts_data$meta)The data list
The $data slot is also a “list” that contains one object
for each sample, named by its sample_id.
For each sample, $data$[sample-name] is a “list” of
three data frames.
-
$data$[sample-name]$alphacontains the pseudo-bulk data for the alpha chain -
$data$[sample-name]$betacontains the pseudo-bulk data for the beta chain -
$data$[sample-name]$pairedcontains alpha and beta chains that were paired computationally by our MAD-HYPE and T-SHELL algorithms.
names(ts_data$data) ## [1] "cd4_tp1_v2" "cd4_tp2_v2" "cd4_tp3_v2" "cd8_tp1_v2" "cd8_tp2_v2" "cd8_tp3_v2"## [1] "cd4_tp1_v2" "cd4_tp2_v2" "cd4_tp3_v2" "cd8_tp1_v2" "cd8_tp2_v2"
## [6] "cd8_tp3_v2"
class(ts_data$data$cd4_tp1_v2) ## list## [1] "list"
names(ts_data$data$cd4_tp1_v2) ## [1] "alpha" "beta" "paired"## [1] "alpha" "beta" "paired"
The pseudo-bulk data frames
In the TIRTLseq assay, one experiment is repeated hundreds of times in a multi-well plate. Generally, an experiment is repeated in a half plate (192 wells) or a whole plate (384 wells). Alpha and beta TCRs are sequenced simultaneously, but separately in each well.
The reads for each well are processed using MiXCR, which calls V and J segments, assembles the CDR3 nucleotide sequence, and generates its corresponding amino acid sequence.
For each clonotype defined by a unique CDR3 nucleotide sequence, the following summary metrics are reported:
-
readCount- sum of read counts over all wells for the clonotype -
readFraction- the fraction of all reads attributed to the clonotype -
sem- standard error of the mean for the read fraction. -
n_wells- the number of wells in which a clonotype is observed -
max_wells- the number of wells used for the experiment (generally the same for all clonotypes) -
readCount_max- the maximum read count in any well for the clonotype -
readCount_median- the median read count for a clonotype
$data$[sample-name]$alpha and
$data$[sample-name]$beta contain the pseudo-bulk data for
alpha and beta chains, respectively.
paged_table(ts_data$data$cd4_tp1_v2$alpha)
paged_table(ts_data$data$cd4_tp1_v2$beta)The paired data frame
We use computational algorithms based on occurrence patterns (MAD-HYPE algorithm) or correlation of read fractions (T-SHELL) of TCR chains across all wells to identify alpha and beta chains present in the same clone.
In general, the MAD-HYPE method works well for moderately frequent clones, but fails for the most frequent clones if they are found in all or almost all wells. The T-SHELL algorithm works well for pairing these very frequent clones.
For each pair, we report:
V/J/CDR3 sequences
-
alpha_nuc_seq,alpha_nuc- alpha CDR3 nucleotide sequence -
beta_nuc_seq,beta_nuc- beta CDR3 nucleotide sequence -
alpha_beta- alpha-beta pair concatenated nucleotide sequence -
cdr3a- alpha CDR3aa sequence -
va- alpha V gene -
ja- alpha J gene -
cdr3b- beta CDR3aa sequence -
vb- beta V gene -
jb- beta J gene
Alpha/beta chain occurence and overlap
-
wa- number of wells where alpha is found -
wb-number of wells where beta is found -
wi- number of wells where alpha is found, but not beta -
wj- number of wells where beta is found, but not alpha -
wij- number of wells where both alpha and beta are found
Pairing method scores and metrics
-
method- method used for pairing: T-SHELL or MAD-HYPE -
r- for T-SHELL, the Pearson correlation coefficient between alpha and beta frequencies -
ts- for T-SHELL, the t-statistic of the correlation -
pval- for T-SHELL, the p-value of the correlation -
pval_adj- for T-SHELL, the adjusted p-value of the correlation -
score- for MAD-HYPE, the score of the pairing (higher is better) -
loss_a_frac- the fraction of wells with lost alpha chain -
loss_b_frac- the fraction of wells with lost beta chain
paged_table(ts_data$data$cd4_tp1_v2$paired)Note: Some TCR pairs will be listed twice in the $paired
data frame, once for the “MAD-HYPE” method and once for the “T-SHELL”
method.
ts_data$data$cd4_tp1_v2$paired %>%
filter(alpha_beta == "TGTGCCGTGAACGGAGGTAACGACTACAAGCTCAGCTTT_TGTGCCAGCTCACCAGAGCAGGGTCTGGCCCAGCATTTT") %>%
select(alpha_nuc, beta_nuc, method, everything()) %>%
paged_table()To see only unique TCR pairs, you may do the following:
pairs = ts_data$data$cd4_tp1_v2$paired
pairs_unique = pairs[!duplicated(pairs$alpha_beta),]
paged_table(pairs_unique)